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Abstract — Privacy preserving data publishing has attracted 
considerable research interest in recent years. Among the existing 
solutions, e-differential privacy provides one of the strongest pri- 
vacy guarantees. Existing data publishing methods that achieve e- 
differential privacy, however, offer little data utility. In particular, 
if the output dataset is used to answer count queries, the noise in 
the query answers can be proportional to the number of tuples 
in the data, which renders the results useless. 

In this paper, we develop a data publishing technique that 
ensures e-differential privacy while providing accurate answers 
for range-count queries, i.e., count queries where the predicate on 
each attribute is a range. The core of our solution is a framework 
that applies wavelet transforms on the data before adding noise 
to it. We present instantiations of the proposed framework for 
both ordinal and nominal data, and we provide a theoretical 
analysis on their privacy and utility guarantees. In an extensive 
experimental study on both real and synthetic data, we show the 
effectiveness and efficiency of our solution. 

I. Introduction 

The boisterous sea of liberty is never without a wave. — 
Thomas Jefferson. 

Numerous organizations, like census bureaus and hospitals, 
maintain large collections of personal information (e.g., cen- 
sus data and medical records). Such data collections are of 
significant research value, and there is much benefit in making 
them publicly available. Nevertheless, as the data is sensitive 
in nature, proper measures must be taken to ensure that its 
publication does not endanger the privacy of the individuals 
that contributed the data. A canonical solution to this problem 
is to modify the data before releasing it to the public, such 
that the modification prevents inference of private information 
while retaining statistical characteristics of the data. 

A plethora of techniques have been proposed for privacy 
preserving data publishing (see [1], [2] for surveys). Existing 
solutions make different assumptions about the background 
knowledge of an adversary who would like to attack the 
data — i.e., to learn the private information about some 
individuals. Assumptions about the background knowledge of 
the adversary determine what types of attacks are possible 
[3]-[5]. A solution that makes very conservative assumptions 
about the adversary's background knowledge is e-differential 
privacy [6]. Informally, e-differential privacy requires that the 
data to be published should be generated using a randomized 
algorithm Q, such that the output of Q is not very sensitive to 
any particular tuple in the input, i.e., the output of Q should 
rely mainly on general properties of the data. This ensures 
that, by observing the data modified by Q, the adversary is 
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not able to infer much information about any individual tuple 
in the input data, and hence, privacy is preserved. 

The simplest method to enforce e-differential privacy, as 
proposed by Dwork et al. [6], is to first derive the frequency 
distribution of the tuples in the input data, and then publish 
a noisy version of the distribution. For example, given the 
medical records in Table U Dwork et al.'s method first maps 
the records to the frequency matrix in Table [TTJ where each 
entry in the first (second) column stores the number of diabetes 
(non-diabetes) patients in Table H] that belong to a specific age 
group. After that, Dwork et al.'s method adds an independent 
noisqj with a 9(1) variance to each entry in Table iHl (we will 
review this in detail in Section III-BI ). and then publishes the 
noisy frequency matrix. 

Intuitively, the noisy frequency matrix preserves privacy, as 
it conceals the exact data distribution. In addition, the matrix 
can provide approximate results for any queries about Table Q] 
For instance, if a user wants to know the number of diabetes 
patients with age under 50, then she can obtain an approximate 
answer by summing up the first three entries in the first column 
of the noisy frequency matrix. 

Motivation. Dwork et al.'s method provides reasonable ac- 
curacy for queries about individual entries in the frequency 
matrix, as it injects only a small noise (with a constant 
variance) into each entry. For aggregate queries that involve 
a large number of entries, however, Dwork et al.'s method 
fails to provide useful results. In particular, for a count query 
answered by taking the sum of a constant fraction of the entries 
in the noisy frequency matrix, the approximate query result has 
a 0(m) noise variance, where m denotes the total number of 
entries in the matrix. Note that m is typically an enormous 

1 Throughout the paper, we use the term "noise" to refer to a random variable 
with a zero mean. 



number, as practical datasets often contain multiple attributes 
with large domains. Hence, a 8 (to) noise variance can render 
the approximate result meaningless, especially when the actual 
result of the query is small. 

Our Contributions. In this paper, we introduce Privelet 
(privacy preserving wa velet) , a data publishing technique 
that not only ensures e-differential privacy, but also provides 
accurate results for all range-count queries, i.e., count queries 
where the predicate on each attribute is a range. Specifically, 
Privelet guarantees that any range-count query can be an- 
swered with a noise whose variance is polylogarithmic in 
to. This significantly improves over the 0(m) noise variance 
bound provided by Dwork et al.'s method. 

The effectiveness of Privelet results from a novel application 
of wavelet transforms, a type of linear transformations that has 
been widely adopted for image processing [7] and approximate 
query processing [8]. As with Dwork et al.'s method, Privelet 
preserves privacy by modifying the frequency matrix M of the 
input data. Instead of injecting noise directly into M, however, 
Privelet first applies a wavelet transform on M, converting M 
to another matrix C. Privelet then adds a polylogarithmic noise 
to each entry in C, and maps C back to a noisy frequency 
matrix M* , The matrix M* thus obtained has an interesting 
property: The result of any range-count query on M* can be 
expressed as a weighted sum of a polylogarithmic number of 
entries in C. Furthermore, each of these entries contributes 
at most polylogarithmic noise variance to the weighted sum. 
Therefore, the variance of the noise in the query result is 
bounded by a polylogarithm of to. 

The remainder of the paper is organized as follows. Sec- 
tion|lI]gives a formal problem definition and reviews Dwork et 
al.'s solution. In Section|Ill] we present the Privelet framework 
for incorporating wavelet transforms in data publishing, and 
we establish a sufficient condition for achieving e-differential 
privacy under the framework. We then instantiate the frame- 
work with three differential wavelet transforms. Our first 
instantiation in Section [IV] is based on the Haar wavelet 
transform [7], and is applicable for one-dimensional ordinal 
data. Our second instantiation in SectionlVlis based on a novel 
nominal wavelet transform, which is designed for tables with 
a single nominal attribute. Our third instantiation in Section 
IVTl is a composition of the first two and can handle multi- 
dimensional data with both ordinal and nominal attributes. 
We conduct a rigorous analysis on the properties of each 
instantiation, and provide theoretical bounds on privacy and 
utility guarantees, as well as time complexities. In Section IVTll 
we demonstrate the effectiveness and efficiency of Privelet 
through extensive experiments on both real and synthetic 
data. Section IVIIII discusses related work. In Section [IX] we 
conclude with directions for future work. 

II. Preliminaries 

A. Problem Definition 

Consider that we want to publish a relational table T that 
contains d attributes A\, A2, Ad, each of which is either 
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Fig. 1. A Hierarchy of Countries 

ordinal (i.e., discrete and ordered) or nominal (i.e., discrete and 
unordered). Following previous work [9], [10], we assume that 
each nominal attribute Ai in T has an associated hierarchy, 
which is a tree where (i) each leaf is a value in the domain 
of At, and (ii) each internal node summarizes the leaves in 
its subtree. Figure Q] shows an example hierarchy of countries. 
We define n as the number of tuples in T, and m as the size 
of the multi-dimensional domain on which T is defined, i.e., 

m = Uti\M- 

We aim to release T using an algorithm that ensures e- 
differential privacy. 

Definition! (e-Differential Privacy [6]): A randomized al- 
gorithm Q satisfies e-differential privacy, if and only if (i) for 
any two tables T\ and T 2 that differ only in one tuple, and (ii) 
for any output O of Q, we have 

Pr{g(T 1 )=0}<e f - -Pr{g{T 2 ) = 0}. U 

We optimize the utility of the released data for OLAP-style 
range-count queries in the following form: 

SELECT COUNT(*) FROM T 

WHERE Ai £ Si AND A 2 € 5 2 AND ... AND A d C S d 

For each ordinal attribute Ai, Si is an interval defined on 
the domain of Ai. If Ai is nominal, Si is a set that contains 
either (i) a leaf in the hierarchy of Ai or (ii) all leaves in 
the subtree of an internal node in the hierarchy of Ai — this 
is standard for OLAP-style navigation using roll-up or drill- 
down. For example, given the hierarchy in Figure Q] examples 
of Si are {USA}, {Canada}, and the set of all countries 
in North America, etc. Range-count queries are essential for 
various analytical tasks, e.g., OLAP, association rule mining 
and decision tree construction over a data cube. 

B. Previous Approaches 

As demonstrated in Section G] the information in T can be 
represented by a d-dimensional frequency matrix M with to 
entries, such that (i) the i-th (i £ [l,d]) dimension of M is 
indexed by the values of Ai, and (ii) the entry in M with a 
coordinate vector (x\, X2, ■ ■ ■ , x<i) stores the number of tuples 
t in T such that t = (x%,X2, ■ ■ ■ , Xd). (This is the lowest level 
of the data cube of T.) Observe that any range-count query 
on T can be answered using M, by summing up the entries 
in M whose coordinates satisfy all query predicates. 

Dwork et al. [6] prove that M can be released in a privacy 
preserving manner by adding a small amount of noise to each 
entry in M independently. Specifically, if the noise rj follows 



a Laplace distribution with a probability density function 

Pr{r, = x] = (1) 

then the noisy frequency matrix ensures (2/A)-differential 
privacy. We refer to A as the magnitude of the noise. Note 
that a Laplace noise with magnitude A has a variance 2A 2 . 

Privacy Analysis. To explain why Dwork et al.'s method 
ensures privacy, suppose that we arbitrarily modify a tuple 
in T. In that case, the frequency matrix of T will change 
in exactly two entries, each of which will be decreased or 
increased by one. For example, assume that we modify the 
first tuple in Table [I] by setting its age value to "30-39". Then, 
in the frequency matrix in Table HJ the first (second) entry 
of the second column will be decreased (increased) by one. 
Intuitively, such small changes in the entries can be easily 
offset by the noise added to the frequency matrix. In other 
words, the noisy matrix is insensitive to any modification to 
a single tuple in T. Thus, it is difficult for an adversary to 
infer private information from the noisy matrix. More formally, 
Dwork et al.'s method is based on the concept of sensitivity. 

Definition 2 (Sensitivity [6]): Let F be a set of functions, 
such that the output of each function / 6 F is a real number. 
The sensitivity of F is defined as 

S(F) = max^|/(T 1 )-/(T 2 )|, (2) 

Tl ' T2 / 6 F 

where T\ and T2 are any two tables that differ in only one 
tuple. ■ 

Note that the frequency matrix M of T can be regarded 
as the outputs of a set of functions, such that each function 
maps T to an entry in M. Modifying any tuple in T will only 
change the values of two entries (in M) by one. Therefore, 
the set of functions corresponding to M has a sensitivity of 
2. The following theorem shows a sufficient condition for e- 
differential privacy. 

Theorem 1 ( [6]): Let F be a set of functions with a 
sensitivity S(F). Let Q be an algorithm that adds independent 
noise to the output of each function in F, such that the noise 
follows a Laplace distribution with magnitude A. Then, Q 
satisfies (S'(F)/A)-differential privacy. ■ 

By Theorem Q] Dwork et al.'s method guarantees (2/A)- 
differential privacy, since M corresponds to a set of queries 
on T with a sensitivity of 2. 

Utility Analysis. Suppose that we answer a range-count query 
using a noisy frequency matrix M* generated by Dwork et 
al.'s method. The noise in the query result has a variance 
0(m/e 2 ) in the worst case. This is because (i) each entry in 
M* has a noise variance 8/e 2 (by Equation [T] and e = 2/ A), 
and (ii) a range-count query may cover up to rn entries in M*. 
Therefore, although Dwork et al.'s method provides reasonable 
accuracy for queries that involve a small number of entries in 
M*, it offers unsatisfactory utility for large queries that cover 
many entries in M*. 



III. The Privelet Framework 

This section presents an overview of our Privelet technique. 
We first clarify the key steps of Privelet in Section IIII-AI 
and then provide in Section IIII-BI a sufficient condition for 
achieving e-differential privacy with Privelet. 

A. Overview of Privelet 

Our Privelet technique takes as input a relational table T 
and a parameter A and outputs a noisy version M* of the 
frequency matrix M of T. At a high level, Privelet works in 
three steps as follows. 

First, it applies a wavelet transform on M. Generally 
speaking, a wavelet transform is an invertible linear function, 
i.e., it maps M to another matrix C, such that (i) each entry 
in C is a linear combination of the entries in M, and (ii) 
M can be losslessly reconstructed from C. The entries in C 
are referred to as the wavelet coefficients. Note that wavelet 
transforms are traditionally only defined for ordinal data, and 
we create a special extension for nominal data in our setting. 

Second, Privelet adds an independent Laplace noise to each 
wavelet coefficient in a way that ensures e-differential privacy. 
This results in a new matrix C* with noisy coefficients. In the 
third step, Privelet (optionally) refines C* , and then maps C* 
back to a noisy frequency matrix M*, which is returned as the 
output. The refinement of C* may arbitrarily modify C* , but it 
does not utilize any information from T or M. In other words, 
the third step of Privelet depends only on C* . This ensures 
that Privelet does not leak any information of T, except for 
what has been disclosed in C*. Our solution in Section fVl 
incorporates a refinement procedure to achieve better utility 
for range-count queries. 

B. Privacy Condition 

The privacy guarantee of Privelet relies on its second step, 
where it injects Laplace noise into the wavelet coefficient 
matrix C. To understand why this achieves e-differential 
privacy, recall that, even if we arbitrarily replace one tuple 
in the input data, only two entries in the frequency matrix 
M will be altered. In addition, each of those two entries will 
be offset by exactly one. This will incur only linear changes 
in the wavelet coefficients in C, since each coefficient is a 
linear combination of the entries in M. Intuitively, such linear 
changes can be concealed, as long as an appropriate amount 
of noise is added to C. 

In general, the noise required for each wavelet coefficient 
varies, as each coefficient reacts differently to changes in M. 
Privelet decides the amount of noise for each coefficient based 
on a weight function W, which maps each coefficient to a 
positive real number. In particular, the magnitude of the noise 
for a coefficient c is always set to A/W(c), i.e., a larger weight 
leads to a smaller noise. To analyze the privacy implication 
of such a noise injection scheme, we introduce the concept of 
generalized sensitivity. 

Definition3 (Generalized Sensitivity): Let F be a set of 
functions, each of which takes as input a matrix and outputs 



a real number. Let W be a function that assigns a weight to 
each function / E F, The generalized sensitivity of F with 
respect to W is defined as the smallest number p such that 

E (w) • i/( M ) - /( M ')i ) < P • W M\ , 

where M and M' are any two matrices that differ in only one 
entry, and \\M — M'\\i = Y^veM-M' M * s me distance 
between M and M' . ■ 

Observe that each wavelet coefficient c can be regarded as 
the output of a function / that maps the frequency matrix M to 
a real number. Thus, the wavelet transform can be regarded as 
the set of functions corresponding to the wavelet coefficients. 
The weight W(c) we assign to each coefficient c can be 
thought of as a weight given to the function associated with c. 
Intuitively, the generalized sensitivity captures the "weighted" 
sensitivity of the wavelet coefficients with respect to changes 
in M. The following lemma establishes the connection be- 
tween generalized sensitivity and e-differential privacy. 

Lemma 1: Let F be a set of functions that has a generalized 
sensitivity p with respect to a weight function W. Let Q be a 
randomized algorithm that takes as input a table T and outputs 
a set {f(M) + rj(f) \ / £ F} of real numbers, where M is 
the frequency matrix of T, and r/(f) is a random variable that 
follows a Laplace distribution with magnitude X/W(f). Then, 
Q satisfies (2p/A)-differential privacy. 

Proof: See Appendix lAl ■ 

By Lemma Q] if a wavelet transform has a generalized 
sensitivity p with respect to weight function W, then we 
can achieve e-differential privacy by adding to each wavelet 
coefficient c a Laplace noise with magnitude 2p/W(c). This 
justifies the noise injection scheme of Privelet. 

IV. Privelet for One-Dimensional Ordinal Data 

This section instantiates the Privelet framework with the 
one-dimensional Haar wavelet transform [7] (HWT), a popu- 
lar technique for processing one-dimensional ordinal data. The 
one-dimensional HWT requires the input to be a vector that 
contains 2 l (I e IN) totally ordered elements. Accordingly, we 
assume wlog. that (i) the frequency matrix M has a single 
ordinal dimension, and (ii) the number m of entries in M 
equals 2 l (this can be ensured by inserting dummy values into 
M [7]). We first explain the HWT in Section ITv^Al and then 
present the instantiation of Privelet in Section IIV-BI 

A. One-Dimensional Haar Wavelet Transform 

The HWT converts M into 2 l wavelet coefficients as 
follows. First, it constructs a full binary tree R with 2 l leaves, 
such that the i-th leaf of R equals the z-th entry in M 
(i E [1, 2']). It then generates a wavelet coefficient c for each 
internal node N in R, such that c = (<zi — a<z)/2, where a\ (02) 
is the average value of the leaves in the left (right) subtree of 
AT. After all internal nodes in R are processed, an additional 
coefficient (referred to as the base coefficient) is produced by 
taking the mean of all leaves in R. For convenience, we refer to 
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Fig. 2. One-Dimensional Haar Wavelet Transform 



R as the decomposition tree of M, and slightly abuse notation 
by not distinguishing between an internal node in R and the 
wavelet coefficient generated for the node. 

Example 1: Figure [2] illustrates an HWT on a one- 
dimensional frequency matrix M with 8 entries V\,...,v%. 
Each number in a circle (square) shows the value of a wavelet 
coefficient (an entry in M). The base coefficient cq equals the 
mean 5.5 of the entries in M. The coefficient ci has a value 
—0.5, because (i) the average value of the leaves in its left 
(right) subtree equals 5 (6), and (ii) (5 - 6)/2 = -0.5. ■ 

Given the Haar wavelet coefficients of M, any entry v in 
M can be easily reconstructed. Let Co be the base coefficient, 
and Ci (i E [1,1]) be the ancestor of v at level i of the 
decomposition tree R (we regard the root of R as level 1). 
We have 



Co 



E (9* ' c *) ' 



(3) 



where g; equals 1 (—1) if v is in the left (right) subtree of Cj. 

Example 2: In the decomposition tree in Figure [2] the leaf 
V2 has three ancestors c\ = —0.5, C2 = 1, and C4 = 3. Note 
that V2 is in the right (left) subtree of C4 (ci and C2), and the 
base coefficient cq equals 5.5. We have vi = 3 = Co + c± + 
c 2 - C4. ■ 

B. Instantiation of Privelet 

Privelet with the one-dimensional HWT follows the three- 
step paradigm introduced in Section IIII-AI Given a parameter 
A and a table T with a single ordinal attribute, Privelet 
first computes the Haar wavelet coefficients of the frequency 
matrix M of T. It then adds to each coefficient c a random 
Laplace noise with magnitude A/Wi/ oor (c), where Wnaar is 
a weight function defined as follows: For the base coefficient 
c, Wnaar (c) = m\ for a coefficient Cj at level i of the decom- 
position tree, Wnaar (ci) — 2 l ~ l+1 , For example, given the 
wavelet coefficients in Figure |2] Wnaar would assign weights 
8, 8, 4, 2 to Co, ci, C2, and C4, respectively. After the noisy 
wavelet coefficients are computed, Privelet converts them back 
to a noisy frequency matrix M* based on Equation[3] and then 
terminates by returning M* . 

This instantiation of Privelet with the one-dimensional 
HWT has the following property. 



Lemma 2: The one-dimensional HWT has a generalized 
sensitivity of 1 + log 2 m with respect to the weight function 

W H aar- 

Proof: Let C be the set of Haar wavelet coefficients of 
the input matrix M. Observe that, if we increase or decrease 
any entry v in M by a constant 8, only 1 + log 2 m coefficients 
in C will be changed, namely, the base coefficient cq and all 
ancestors of v in the decomposition tree R. In particular, Cq 
will be offset by S/m; for any other coefficient, if it is at 
level i of the decomposition tree R, then it will change by 
5/2 l ~' l+1 . Recall that Wnaar assigns a weight of m to Co, 
and a weight of 2 l ~ l+1 to any coefficient at level i of R. 
Thus, the generalized sensitivity of the one-dimensional Haar 
wavelet transform with respect to Wff aor is 

i 

(m-6/m + J2( 2l ~ l+1 • 2 • <5/2 i_<+1 ))/5 = 1 + log 2 m. 
i=i 

■ 

By Lemmas [T] and |2 Privelet with the one-dimensional 
HWT ensures e-differential privacy with e = 2(l + log 2 m)/A, 
where A is the input parameter. On the other hand, Privelet 
also provides strong utility guarantee for range-count queries, 
as shown in the following lemma. 

Lemma 3: Let C be a set of one-dimensional Haar wavelet 
coefficients such that each coefficient c £ C is injected 
independent noise with a variance at most (<x/yV#- aar (c)) 2 . 
Let M* be the noisy frequency matrix reconstructed from C. 
For any range-count query answered using M*, the variance 
of noise in the answer is at most (2 + log 2 \M*\)/2 ■ a 2 . 

Proof: See Appendix iBl ■ 

By Lemmas |2] and [3j Privelet achieves e-differential privacy 
while ensuring that the result of any range-count query has a 
noise variance bounded by 

(2 + log 2 m)-(2 + 21og 2 ™) 2 /e 2 = 0((log 2 mf/e 2 ) (4) 

In contrast, as discussed in Section lll-BI with the same privacy 
requirement, Dwork et al.'s method incurs a noise variance of 
0(m/e 2 ) in the query answers. 

Before closing this section, we point out that Privelet with 
the one-dimensional HWT has an 0(n + m) time complexity 
for construction. This follows from the facts that (i) mapping 
T to M takes 0(m + n) time, (ii) converting M to and from 
the Haar wavelet coefficients incur 0(m) overhead [7], and 
(iii) adding Laplace noise to the coefficients takes 0(m) time. 

V. Privelet for One-dimensional Nominal Data 

This section extends Privelet for one-dimensional nom- 
inal data by adopting a novel nominal wavelet transform. 
Section IV-AI introduces the new transform, and Section IV-BI 
explains the noise injection scheme for nominal wavelet coeffi- 
cients. Section [V-CI analyzes the privacy and utility guarantees 
of the algorithm and its time complexity. Section IV-DI com- 
pares the algorithm with an alternative solution that employs 
the HWT. 
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Fig. 3. A nominal wavelet transform 

A. Nominal Wavelet Transform 

Existing wavelet transforms are only designed for ordinal 
data, i.e., they require that each dimension of the input matrix 
needs to have a totally ordered domain. Hence, they are 
not directly applicable on nominal data, since the values of 
a nominal attribute A are not totally ordered. One way to 
circumvent this issue is to impose an artificial total order on 
the domain of A, such that for any internal node N in the 
hierarchy of A, the set of leaves in the subtree of N constitutes 
a contiguous sequence in the total order. 

For example, given a nominal attribute A with the hierarchy 
H in Figure [3] we impose on A a total order V\ < t>2 < 
... < Vq. As such, A is transformed into an ordinal attribute 
A'. Recall that for a nominal attribute, the range-count query 
predicate "A G S" has a special structure: S either contains (i) 
a leaf in the hierarchy of A or (ii) all leaves in the subtree of 
an internal node in the hierarchy of A. Therefore, S is always 
a contiguous range in the imposed total order of A. With this 
transformation, we can apply Privelet with the HWT on any 
one-dimensional nominal data. The noise variance bound thus 
obtained is 0((log 2 m) 3 /e 2 ) (see Equation|4]i. 

While using the HWT is one possible solution, Privelet 
does not stop here. We will show how to improve the above 
0((log 2 m) 3 /e 2 ) bound to 0(h 2 /e 2 ), where h is the height of 
the hierarchy H on the nominal data. (Note that h < log 2 m 
holds for any hierarchy where each internal node has at least 
two children.) This improvement can result in a reduction of 
noise variance by an order of magnitude or more in practice, 
as we will discuss in Section lV-DI The core of our solution is a 
novel wavelet transform that creates a different decomposition 
tree for generating wavelet coefficients. 

A first thought for a different decomposition tree might be 
to use the hierarchy H, i.e., to generate wavelet coefficients 
from each internal node N in H. Intuitively, if N has only 
two children, then we may produce a coefficient c from TV 
as in the HWT, i.e., we first compute the average value a\ 
(a-i) of the leaves in the left (right) subtree of N, and then set 
c = (ai+a 2 )/2. But what if N has k (k > 2) children? Should 
we generate one coefficient from each pair of subtrees of iV? 
But that will result in ( 2 ) coefficients, which is undesirable 
when k is large. Is it possible to generate coefficients with- 
out relying on pairwise comparison of subtrees? We answer 
this question positively with the introduction of the nominal 
wavelet transform. 

Given a one-dimensional frequency matrix M and a hier- 
archy H on the entries in M, the nominal wavelet transform 



first constructs a decomposition tree R from H by attaching 
a child node N c to each leaf node N in H. The value of 
N c is set to the value of the entry in M that corresponds to 
N. For example, given the hierarchy H in the left hand side 
of Figure [3] the decomposition tree R constructed from H 
is as in right hand side of the figure. In the second step, the 
nominal wavelet transform computes a wavelet coefficient for 
each internal node of R as follows. The coefficient for the root 
node (referred to as the base coefficient) is set to the sum of 
all leaves in its subtree, also called the leaf-sum of the node. 
For any other internal node, its coefficient equals its leaf-sum 
minus the average leaf-sum of its parent's children. 

Given these nominal wavelet coefficients of M, each entry 
v in M can be reconstructed using the ancestors of v in the 
decomposition tree R. In particular, let c, be the ancestor of 
v in the (i + l)-th level of R, and /j be the fanout of Cj, we 
have 

h-2 / h-2 ^ \ 

v = c h -i + ( °* • n t. > (5) 

where h is the height of the hierarchy H on M, To understand 
Equation [5] recall that co equals the leaf-sum of the root in 
R, while Cfe (k £ [l,h — 1]) equals the leaf-sum of cu minus 
the average leaf-sum of c^-i's children. Thus, the leaf-sum 
of ci equals c\ + Co/ /o, the leaf-sum of c 2 equals c 2 + (ci + 
co//o)//i, and so on. It can be verified that the leaf-sum of 
Ch-i equals exactly the right hand side of Equation [5] Since 
v is the only leaf of Ch-i in R, Equation [5] holds. 

Example 3: Figure [3] illustrates a one-dimensional fre- 
quency matrix M, a hierarchy H associated with M, and a 
nominal wavelet transform on M. The base coefficient Co = 30 
equals the sum of all leaves in the decomposition tree. The 
coefficient ci equals 3, because (i) it has a leaf-sum 18, (ii) 
the average leaf-sum of its parent's children equals 15, and 
(iii) 18-15 = 3. 

In the decomposition tree in Figure [3] the entry v\ has three 
ancestors, namely, Co, c\, and c 3 , which are at levels 1, 2, and 
3 of decomposition tree, respectively. Furthermore, the fanout 
of Co and c\ equal 2 and 3, respectively. We have v\ = 9 = 
c 3 + co/2/3 + ci/3. " ■ 

Note that our novel nominal wavelet transform is over- 
complete: The number m! of wavelet coefficients we generate 
is larger than the number m of entries in the input frequency 
matrix M. In particular, m! ' — m equals the number of internal 
nodes in the hierarchy H on M. The overhead incurred by 
such over-completeness, however, is usually negligible, as the 
number of internal nodes in a practical hierarchy H is usually 
small compared to the number of leaves in H. 

B. Instantiation of Privelet 

We are now ready to instantiate Privelet for one-dimensional 
nominal data. Given a parameter A and a table T with a 
single nominal attribute, we first apply the nominal wavelet 
transform on the frequency matrix M of T. After that, we 
inject into each nominal wavelet coefficient c a Laplace noise 



with magnitude A/WV om (c). Specifically, WNom{c) = 1 if 
c is the base coefficient, otherwise WNom(c) = f/(2f — 2), 
where / is the fanout of c's parent in the decomposition tree. 

Before converting the wavelet coefficients back to a noisy 
frequency matrix, we refine the coefficients with a mean 
subtraction procedure. In particular, we first divide all but 
the base coefficients into disjoint sibling groups, such that 
each group is a maximal set of noisy coefficients that have 
the same parent in the decomposition tree. For example, the 
wavelet coefficients in Figure [3] can be divided into three 
sibling groups: {ci,c 2 }, {c 3 ,c 4 ,c 5 }, and {c 6 ,c 7 ,c 8 }. After 
that, for each sibling group, the coefficient mean is computed 
and then subtracted from each coefficient in the group. Finally, 
we reconstruct a noisy frequency matrix M* from the modified 
wavelet coefficients (based on Equation q), and return M* as 
the output. 

The mean subtraction procedure is essential to the utility 
guarantee of Privelet that we will prove in Section IV-BI The 
intuition is that, after the mean subtraction procedure, all noisy 
coefficients in the same sibling group sum up to zero; as such, 
for any non-root node N in the decomposition tree, the noisy 
coefficient corresponding to N still equals the noisy leaf-sum 
of N minus the average leaf-sum of the children of N's parent; 
in turn, this ensures that the reconstruction of M* based on 
Equation [5] is meaningful. 

We emphasize that the mean subtraction procedure does not 
rely on any information in T or M; instead, it is performed 
based only on the noisy wavelet coefficients. Therefore, the 
privacy guarantee of M* depends only on the noisy coef- 
ficients generated before the mean subtraction procedure, as 
discussed in Section ITlTl 

C. Theoretical Analysis 

To prove the privacy guarantee of Privelet with the nominal 
wavelet transform, we first establish the generalized sensitivity 
of the nominal wavelet transform with respect to the weight 
function WN om used in the noise injection step. 

Lemma 4: The nominal wavelet transform has a generalized 
sensitivity of h with respect to WNom, where h the height of 
the hierarchy associated with the input frequency matrix. 

Proof: Suppose that we offset an arbitrary entry v in 
the input frequency matrix M by a constant <5. Then, the 
base coefficient of M will change by <5. Meanwhile, for the 
coefficients at level i (i e [2, h]) of the decomposition tree, 
only the sibling group Gi that contains an ancestor of v will 
be affected. In particular, the ancestor of v in Gi will be offset 
by 5 — 8/ 1 Gi I , while the other coefficients in Gi will change 
by <5/|G, |. Recall that WNom assigns a weight 1 to the base 
coefficient, and a weight 1/(2 — 2/|G,;|) for all coefficients 
in Gi. Therefore, the generalized sensitivity of the nominal 
wavelet transform with respect to WNom should be 

( 1 1 1 IGJ - 1 N \ 

1+ g(^27H- (1 -i^ + Tbr ) ) - "• 



By Lemmas Q] and [4] given a one-dimensional nominal 
table T and a parameter A, Privelet with the nominal wavelet 
transform ensures (2/i/A)-differential privacy, where h is the 
height of the hierarchy associated with T. 

Lemma 5: Let C be a set of nominal wavelet coefficients 
such that each c' £ C contains independent noise with a 
variance at most (a/WNom{c')) 2 . Let C* be a set of wavelet 
coefficients obtained by applying a mean subtraction procedure 
on C, and M* be the noisy frequency matrix reconstructed 
from C*. For any range-count query answered using M*, the 
variance of the noise in the answer is less than 4cr 2 . 

Proof: See Appendix ICl ■ 

By Lemmas [4] and [5] when achieving e-differential privacy, 
Privelet with the nominal wavelet transform guarantees that 
each range-count query result has a noise variance at most 
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4 • 2 ■ (2h) 2 /e 2 = 0(h 2 /e 2 ). 



(6) 



As h < log 2 to holds in practice, the above 0(/i 2 /e 2 ) bound 
significantly improves upon the 0(m/e 2 ) bound given by 
previous work. 

Privelet with the nominal wavelet transform runs in 0(n + 
to) time. In particular, computing M from T takes 0(n) 
time; the nominal wavelet transform on M has an 0{m) 
complexity. The noise injection step incurs 0(m) overhead. 
Finally, with a breath-first traversal of the decomposition tree 
R, we can complete both the mean subtraction procedure 
and the reconstruction of the noisy frequency matrix. Such 
a breath-first traversal takes 0{m) time under the realistic 
assumption that the number of internal nodes in R is 0(m). 

D. Nominal Wavelet Transform vs. Haar Wavelet Transform 

As discussed in Section IV-AI Privelet with the HWT can 
provide an 0((log 2 m) 3 /e 2 ) noise variance bound for one- 
dimensional nominal data by imposing a total order on the 
nominal domain. Asymptotically, this bound is inferior to the 
0(h 2 /e 2 ) bound in Equation [6] but how different are they in 
practice? To answer this question, let us consider the nominal 
attribute Occupation in the Brazil census dataset used in our 
experiments (see Section [VTll for details). It has a domain with 
to = 512 leaves and a hierarchy with 3 levels. Suppose that we 
apply Privelet with the one-dimensional HWT on a dataset that 
contains Occupation as the only attribute. Then, by Equation|4] 
we can achieve a noise variance bound of 

(2 + log 2 TO) • (2 + 21og 2 TO) 2 /e 2 = 4400/e 2 . 

In contrast, if we use Privelet with the nominal wavelet 
transform, the resulting noise variance is bounded by 

4 • 2 • (2h) 2 /e 2 = 288/e 2 , 

i.e., we can obtain a 15-fold reduction in noise variance. 
Due to the superiority of the nominal wavelet transform over 
the straightforward HWT, in the remainder of paper we will 
always use the former for nominal attributes. 
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VI. Multi-Dimensional Privelet 

This section extends Privelet for multi-dimensional data. 
Section IVI-AI presents our multi-dimensional wavelet trans- 
form, which serves as the basis of the new instantiation of 
Privelet in Section IVI-BI Section IVI-CI analyzes properties of 
the new instantiation, while Section IVI-D I further improves its 
utility guarantee. 

A. Multi-Dimensional Wavelet Transform 

The one-dimensional wavelet transforms can be extended 
to multi-dimensional data using standard decomposition [7], 
which works by applying the one-dimensional wavelet trans- 
forms along each dimension of the data in turn. More specifi- 
cally, given a frequency matrix M with d dimensions, we first 
divide the entries in M into one-dimensional vectors, such 
that each vector contains a maximal set of entries that have 
identical coordinates on all but the first dimensions. For each 
vector V, we convert it into a set S of wavelet coefficients 
using the one-dimensional Haar or nominal wavelet transform, 
depending on whether the first dimension of M is ordinal or 
nominal. After that, we store the coefficients in S in a vector 
V', where the coefficients are sorted based on a level-order 
traversal of the decomposition tree (the base coefficient always 
ranks first). The i-th (i £ [1, S]) coefficient in V' is assigned d 
coordinates (i, a; 2 , X3, . . . , Xd), where Xj is the j-th coordinate 
of the entries in V {j £ [2, d]; recall that the j-th coordinates of 
these entries are identical). After that, we organize all wavelet 
coefficients into a new e?-dimensional matrix C\ according to 
their coordinates. 

In the second step, we treat C\ as the input data, and 
apply a one-dimensional wavelet transform along the second 
dimension of C\ to produce a matrix C 2 , in a manner similar 
to the transformation from M to C\ . In general, the matrix Cj 
generated in the i-th step will be used as the input to the (i+1)- 
th step. In turn, the (i+l)-th step will apply a one-dimensional 
wavelet transform along the (i + l)-th dimension of C\, and 
will generate a new matrix C;+i. We refer to C.- L as the step- 
i matrix. After all d dimensions are processed, we stop and 
return Cd as the result. We refer to the transformation from M 
to Cd as an Haar-nominal (HN) wavelet transform. Observe 
that Cd can be easily converted back to the original matrix 
M, by applying inverse wavelet transforms along dimensions 
d, d — 1, . . . , 1 in turn. 

Example 4: Figure [4] illustrates an HN wavelet transform 
on a matrix M with two ordinal dimensions. In the first step 
of the transform, M is vertically divided into two vectors 
(un, V12) and (u 2 i, w 22 ). These two vectors are then converted 
into two new vectors (wn,fi 2 ) and (w 21 ,w 22 ) using the 



one-dimensional HWT. Note that v' n and v' 21 are the base 
coefficients. The new matrix C\ is the step-1 matrix. 

Next, C\ is horizontally partitioned into two vectors 
{v'n , v'21) and (v'^jV^)- We apply the HWT on them, and 
generate two coefficient vectors (cn,C2i) and (012,022), with 
Cix and cyi being the base coefficients. The matrix C2 is 
returned as the final result. ■ 

B. Instantiation of Privelet 

Given a e?-dimensional table T and a parameter A, Privelet 
first performs the HN wavelet transform on the frequency 
matrix M of T. Then, it adds a Laplace noise with magnitude 
X/Wbn(c) to each coefficient c, where Whn is a weight 
function that we will define shortly. Next, it reconstructs a 
noisy frequency matrix M* using the noisy wavelet coeffi- 
cients by inverting the one-dimensional wavelet transforms on 
dimensions d,d — 1, . . . , 1 in turn@ Finally, it terminates by 
returning M*. 

The weight function Whn is decided by the one- 
dimensional wavelet transforms adopted in the HN wavelet 
transform. Let Wi be the weight function associated with 
the transform used to compute the step-i (i 6 [l,d]) matrix, 
i.e., Wi = WHaar if the *-th dimension of M is ordinal, 
otherwise Wi = W^v om . We determine the weight Whn{c) 
for each HN wavelet coefficient c as follows. First, during the 
construction of the step-1 matrix C%, whenever we generate a 
coefficient vector V', we assign to each d 6 V a weight 
Wi(c'). For instance, if the first dimension A\ of M is 
nominal, then Wi(c') = 1 if c is the base coefficient, otherwise 
Wi(c') = //(2/ - 2), where / is the fanout of the parent of 
c' in the decomposition tree. Due to the way we arrange the 
coefficients in C\, if two coefficients in C\ have the same 
coordinates on the first dimension, they must have identical 
weights. 

Now consider the second step of the HN wavelet transform. 
In this step, we first partition C\ into vectors along the 
second dimension, and then apply one-dimensional wavelet 
transforms to convert each vector V" to into a new coefficient 
vector V*. Observe that all coefficients in V" should have the 
same weight, since they have identical coordinates on the first 
dimension. We set the weight of each c* 6 V* to be W 2 (c*) 
times the weight shared by the coefficients in V". 

In general, in the i-th step of the HN wavelet transform, 
whenever we generate a coefficient c from a vector V C Cj_i, 
we always set the weight of c to the product of Wj(c) and the 
weight shared by the coefficients in V — all coefficients in V 
are guaranteed to have the same weight, because of the way 
we arrange the entries in C;_i. The weight function Whn for 
the HN wavelet transform is defined as a function that maps 
each coefficient in d to its weight computed as above. For 
convenience, for each coefficient c S Ci (i £ [1, d — 1]), we 
also use Whn{c) to denote the weight of c in Cj. 

2 If the i-the dimension is nominal, then, whenever we convert a vector V 
in the step-i matrix back to a vector V in the step-(i — 1) matrix, we will 
apply the mean substraction procedure before the reconstruction of V. 



Example 5: Consider the HN wavelet transform in Figure SI 
Both dimensions of the frequency matrix M are nominal, and 
hence, the weight function for both dimensions is WHaar- In 
the step-1 matrix C\, the weights of the coefficients v' n and 
v'21 e 1 ua l 1/2, because (i) they are the base coefficients in the 
wavelet transforms on («n,i>i2) and (V21, ^21), respectively, 
and (ii) Wnaar assigns a weight 1/2 to the base coefficient 
whenever the input vector contains only two entries. 

Now consider the coefficient en in the step-2 matrix C2. 
It is generated from the HWT on (ojj,!)^}, where both v' lx 
and v'12 have a weight 1/2. In addition, as en is the base 
coefficient, WHaar( c ii) — 1/2. Consequently, Whn(cii) = 

1/2 • WHaaMx) = 1/4. ' ■ 

C. Theoretical Analysis 

As Privelet with the HN wavelet transform is essentially a 
composition of the solutions in Sections [IV-Bl and lV-BI we can 
prove its privacy (utility) guarantee by incorporating Lemmas 
|2] and |4] (f3] and ID with an induction argument on the dataset 
dimensionality d. Let us define a function V that takes as input 
any attribute A, such that 

1 + log 2 \A\ if A is ordinal 

the height h of A's hierarchy otherwise 

Similarly, let H be a function such that 



V(A) 



H(A) 



(2 + log 2 \A\)/2 if A is ordinal 
4 otherwise 



We have the following theorems that show (i) the generalized 
sensitivity of the HN wavelet transform (Theorem |2]l and (ii) 
the noise variance bound provided by Privelet with the HN 
wavelet transform (Theorem |3). 

Theorem 2: The HN wavelet transform on a <i-dimensional 
matrix M has a generalized sensitivity n^Li^^i) w i m 
respect to Whn, where Ai is the i-th dimension of M. 

Proof: See Appendix ID1 ■ 

Theorem 3: Let C* d be a d-dimensional HN wavelet coef- 
ficient matrix, such that each coefficient c* € C* d has a noise 
with a variance at most (a /Whn(c*)) ■ Let M* be the noisy 
frequency matrix reconstructed from C* d , and Ai (i e [1, d}) be 
the i-th dimension of M*. For any range-count query answered 
using M*, the noise in the query result has a variance at most 

Proof See Appendix [E] ■ 

By Theorem [2] to achieve e-differential privacy, Privelet 
with the HN wavelet transform should be applied with A = 
2/e • 0-t=i TiAi)', in that case, by Theorem|3] Privelet ensures 
that any range-count query result has a noise variance of at 
most 

, d -.2d 

22/e.[IW -11^) = O (log°W m/e 2 ) , 

^ i=l ' i=l 

since V(Ai) and H(Ai) are logarithmic in m. 

Privelet with the HN wavelet transform has an 0(n + rn) 
time complexity. This is because (i) computing the frequency 



Algorithm Privelet + (T, A, Sa) 

1. map T to its frequency matrix M 

2. divide M into sub-matrices along the dimensions specified in Sa 

3. for each sub-matrix 

4. compute the HN wavelet coefficients of the sub-matrix 

5. add to each coefficient c a Laplace noise with magnitude 

\/Whn{c) 

6. convert the noisy coefficients back to a noisy sub-matrix 

7. assemble the noisy sub-matrices into a frequency matrix M* 

8. return M* 

Fig. 5. The Privelev algorithm 

matrix M takes 0(n + m) time, (ii) each one-dimensional 
wavelet transform on M has 0(m) complexity, and (iii) 
adding Laplace noise to the wavelet coefficients incurs 0(m) 
overhead. 

D. A Hybrid Solution 

We have shown that Privelet outperforms Dwork et al.'s 
method asymptotically in terms of the accuracy of range-count 
queries. In practice, however, Privelet can be inferior to Dwork 
et al.'s method, when the input table T contains attributes with 
small domains. For instance, if T has a single ordinal attribute 
A with domain size \A\ = 16, then Privelet provides a noise 
variance bound of 

2 • (2 • V(A)/e) 2 ■ H(A) = 600/e 2 , 

as analyzed in Section lVI-CI In contrast, Dwork et al.'s method 
incurs a noise variance of at most 

2-(2-|A|/e) 2 = 128/e 2 , 

as shown in Section IH-BI This demonstrates the fact that, 
Dwork et al.'s method is more favorable for small-domain 
attributes, while Privelet is more suitable for attributes whose 
domains are large. How can we combine the advantages of 
both solutions to handle datasets that contain both large- and 
small-domain attributes? 

We answer the above question with the Privelet + algorithm 
illustrated in Figure [5] The algorithm takes as an input a table 
T, a parameter A, and a subset Sa of the attributes in T, It first 
maps T to its frequency matrix M. Then, it divides M into 
sub-matrices, such that each sub-matrix contains the entries in 
M that have the same coordinates on each dimension specified 
in Sa- For instance, given the frequency matrix in Table [II] 
if Sa contains only the "Has Diabetes?" dimension, then the 
matrix would be split into two one-dimensional sub-matrices, 
each of which contains a column in Table [TT] In general, if M 
has d dimensions, then each sub-matrix should have d — \Sa\ 
dimensions. 

After that, each sub-matrix is converted into wavelet coeffi- 
cients using a (d — \Sa\) -dimensional HN wavelet transform. 
Privelet + injects into each coefficient c a Laplace noise with 
magnitude A/Wfljv(c), and then maps the noisy coefficients 
back to a noisy sub-matrix. In other words, Privelet+ processes 
each sub-matrix in the same way as Privelet handles a [d — 
\Sa I) -dimensional frequency matrix. Finally, Privelet+ puts 



TABLE III 
Sizes of Attribute Domains 





Age 


Gender 


Occupation 


Income 


Brazil 


101 


2 (2) 


512 (3) 


1001 


US 


96 


2 (2) 


511 (3) 


1020 



together all noisy sub-matrices to obtain a d-dimensional noisy 
frequency matrix M* , and then terminates by returning M* . 

Observe that Privelet + captures Privelet as a special case 
where Sa — 0- Compared to Privelet, it provides the flexibility 
of not applying wavelet transforms on the attributes in Sa- 
Intuitively, this enables us to achieve better data utility by 
putting in Sa the attributes with small domains, since those 
attributes cannot be handled well with Privelet. Our intuition 
is formalized in Corollary Q] which follows from Theorems [2] 
and [3] 

Corollary 1: Let T be a table that contains a set S of 
attributes. Given T, a subset Sa of S, and a parameter 
A, Privelet + achieves e-differential privacy with e = 2/ A • 
HasS-S ^(^)- I n addition, it ensures that any range-count 
query result has a noise variance at most (Il^es 1^1) ' 

n A ess A nA). " ■ 

By Corollary Q] when e-differential privacy is enforced, 
Privelet + leads to a noise variance bound of 

8A 2 -( II K)' II ((HA)f-H(A)). (7) 

AeSa A£S-Sa 

It is not hard to verify that, when Sa contains only attributes 
A with \A\ < {P(A)f -H{A), the bound given in Equation [7] 
is always no worse than the noise variance bounds provided 
by Privelet and Dwork et al.'s method. 

Finally, we note that Privelet + also runs in 0(n + m) time. 
This follows from the 0(n + rn) time complexity of Privelet. 

VII. Experiments 

This section experimentally evaluates Privelet + and Dwork 
et al.'s method (referred as Basic in the following). Sec- 
tion IVII-AI compares their data utility, while Section IVII-BI 
investigates their computational cost. 

A. Accuracy of Range-Count Queries 

We use two dataset^l that contain census records of individ- 
uals from Brazil and the US, respectively. The Brazil dataset 
has 10 million tuples and four attributes, namely, Age, Gender, 
Occupation, and Income. The attributes Age and Income are 
ordinal, while Gender and Occupation are nominal. The US 
dataset also contains these four attributes (but with slightly 
different domains), and it has 8 million tuples. Table IllTl shows 
the domain sizes of the attributes in the datasets. The numbers 
enclosed in parentheses indicate the heights of the hierarchies 
associated with the nominal attributes. 

For each dataset, we create a set of 40000 random range- 
count queries, such that the number of predicates in each query 

3 Both datasets are public available as part of the Integrated Public Use 
Microdata Series [11]. 



is uniformly distributed in [1,4]. Each query predicate "Ai G 
Si" is generated as follows. First, we choose Ai randomly 
from the attributes in the dataset. After that, if Ai is ordinal, 
then Si is set to a random interval defined on Af, otherwise, 
we randomly select a non-root node from the hierarchy of Ai, 
and let Si contain all leaves in the subtree of the node. We 
define the selectivity of a query q as the fraction of tuples in 
the dataset that satisfy all predicates in q. We also define the 
coverage of q as the fraction of entries in the frequency matrix 
that are covered by q. 

We apply Basic and Privelet + on each dataset to produce 
noisy frequency matrices that ensure e-differential privacy, 
varying e from 0.5 to 1.25. For Privelet + , we set its input 
parameter Sa — {Age, Gender}, since each A of these two 
attributes has a relatively small domain, i.e., \A\ < (V(A)) 2 • 
H(A), where V and H are as defined in Section IVI-CI We use 
the noisy frequency matrices to derive approximate answers 
for range-count queries. The quality of each approximate 
answer x is gauged by its square error and relative error with 
respect to the actual query result act. Specifically, the square 
error of x is defined as (x — act) 2 , and the relative error of x is 
computed as \x — act]/ maxjaci, s}, where s is a sanity bound 
that mitigates the effects of the queries with excessively small 
selectivities (we follow with this evaluation methodology from 
previous work [12], [13]). We set s to 0.1% of the number of 
tuples in the dataset. 

In our first set of experiments, we divide the query set Qb t 
for the Brazil dataset into 5 subsets. All queries in the i-th 
(i G [1, 5]) subset have coverage that falls between the (i — 1)- 
th and i-th quintiles of the query coverage distribution in Qs r - 
On each noisy frequency matrix generated from the Brazil 
dataset, we process the 5 query subsets in turn, and plot in 
Figure [8] the average square error in each subset as a function 
of the average query coverage. Figure [9] shows the results of 
a similar set of experiments conducted on the US dataset. 

The average square error of Basic increases linearly with the 
query coverage, which conforms to the analysis in Section lTl-BI 
that Basic incurs a noise variance linear to the coverage of 
the query. In contrast, the average square error of Privelet + is 
insensitive to the query coverage. The maximum average error 
of Privelet + is smaller than that of Basic by two orders of 
magnitudes. This is consistent with our results that Privelet + 
provides a much better noise variance bound than Basic does. 
The error of both Basic and Privelet + decreases with the 
increase of e, since a larger e leads to a smaller amount of 
noise in the frequency matrices. 

In the next experiments, we divide the query set for each 
dataset into 5 subsets based on query selectivities. Specifically, 
the i-th (i G [1, 5]) subset contains the queries whose selectivi- 
ties are between the (t — l)-th and i-th quintiles of the overall 
query selectivity distribution. Figures [8] and [9] illustrate the 
average relative error incurred by each noisy frequency matrix 
in answering each query subset. The X-axes of the figures 
represent the average selectivity of each subset of queries. 
The error of Privelet + is consistently lower than that of Basic, 
except when the query selectivity is below 10~ 7 . In addition, 



the error of Privelet+ is no more than 25% in all cases, while 
Basic induces more than 70% error in several query subsets. 

In summary, our experiments show that Privelet+ signifi- 
cantly outperforms Basic in terms of the accuracy of range- 
count queries. Specifically, Privelet+ incurs a smaller query 
error than Basic does, whenever the query coverage is larger 
than 1% or the query selectivity is at least 10 -7 . 

B. Computation Time 

Next, we investigate how the computation time of Basic and 
Privelet+ varies with the number n tuples in the input data 
and the number m of entries in the frequency matrix. For this 
purpose, we generate synthetic datasets with various values of 
n and m. Each dataset contains two ordinal attributes and two 
nominal attributes. The domain size of each attribute is to 1 / 4 . 
Each nominal attribute A has a hierarchy H with three levels, 
such that the number of level-2 nodes in H is y>A|. The 
values of the tuples are uniformly distributed in the attribute 
domains. 

In the first set of experiments, we fix to = 2 24 , and apply 
Basic and Privelet + on datasets with n ranging from 1 million 
to 5 millions. For Privelet + , we set its input parameter Sa = 0, 
in which case Privelet + has a relatively large running time, 
since it needs to perform wavelet transforms on all dimensions 
of the frequency matrix. Figure [10] illustrates the computation 
time of Basic and Privelet + as a function of n. Observe that 
both techniques scale linearly with n. 

In the second set of experiments, we set n — 5 x 10 6 , and 
vary to from 2 22 to 2 26 . Figure QT| shows the computation 
overhead of Basic and Privelet + as a function of to. Both 
techniques run in linear time with respect to to. 

In summary, the computation time of Privelet + is linear 
to n and to, which confirms our analysis that Privelet + has 
an 0(n + to) time complexity. Compared to Basic, Privelet + 
incurs a higher computation overhead, but this is justified by 
the facts that Privelet + provides much better utility for range- 
count queries. 

VIII. Related Work 

Numerous techniques have been proposed for ensuring e- 
differential privacy in data publishing [6], [14]— [21]. The 
majority of these techniques, however, are not designed for 
the publication of general relational tables. In particular, the 
solutions by Korolova et al. [14] and Gotz et al. [15] are 
developed for releasing query and click histograms from 
search logs. Chaudhuri and Monteleoni [16], Blum et al. [17], 
and Kasiviswanathan et al. [18] investigate how the results of 
various machine learning algorithms can be published. Nissum 
et al. [19] propose techniques for releasing (i) the median value 
of a set of real numbers, and (ii) the centers of the clusters 
output from the /c-means clustering algorithm. Machanava- 
jjhala et al. [20] study the publication of comminuting patterns, 
i.e., tables with a scheme (ID, Origin, Destination) where 
each tuple captures the residence and working locations of 
an individual. 
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Fig. 6. Average Square Error vs. Query Coverage (Brazil) 
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The work closest to ours is by Dwork et al. [6] and Barak 
et al. [21]. Dwork et al.'s method, as discussed previously, 
is outperformed by our Privelet technique in terms of the 
accuracy of range-count queries. On the other hand, Barak 
et al.'s technique is designed for releasing marginals, i.e., 
the projections of a frequency matrix on various subsets 
of the dimensions. Given a set of marginals, Barak et al.'s 



technique first transforms them into the Fourier domain, then 
adds noise to the Fourier coefficients. After that, it refines 
the noisy coefficients, and maps them back to a set of noisy 
marginals. Although this technique and Privelet have a similar 
framework, their optimization goals are drastically different. 
Specifically, Barak et al.'s technique does not provide utility 
guarantees for range-count queries; instead, it ensures that (i) 




n (millions) m 



Fig. 10. Computation Time vs. n Fig. 11. Computation Time vs. m 

every entry in the noisy marginals is a non-negative integer, 
and (ii) all marginals are mutually consistent, e.g., the sum of 
all entries in a marginal always equals that of another marginal. 

In addition, Barak et al's technique requires solving a linear 
program where the number of variables equals the number m 
of entries in the frequency matrix. This can be computationally 
challenging for practical datasets with a large m. For instance, 
for the two census datasets used in our experiments, we have 
to > 10 8 . In contrast, Privelet runs in time linear to to and 
the number n of tuples in the input table. 

Independent of our work, Hay et al. [22] propose an 
approach for achieving e-differential privacy while ensuring 
polylogorithmic noise variance in range-count query answers. 
Given a one-dimensional frequency matrix M, Hay et al.'s 
approach first computes the results of a set of range-count 
queries on M, and then adds Laplace noise to the results. 
After that, it derives a noisy frequency matrix M* based on 
the noisy query answers, during which it carefully exploits 
the correlations among the answers to reduce the amount of 
noise in M*. Although Hay et al.'s approach and Privelet 
provide comparable utility guarantees, the former is designed 
exclusively for one-dimensional datasets, whereas the latter is 
applicable on datasets with arbitrary dimensionalities. 

There also exists a large body of literature (e.g., [8], 
[12], [13]) on the application of wavelet transforms in data 
management. The focus of this line of research, however, is 
not on privacy preservation. Instead, existing work mainly 
investigates how wavelet transforms can be used to construct 
space- and time-efficient representations of multi-dimensional 
data, so as to facilitate query optimization [12], or approximate 
query processing [8], [13], just to name two applications. 

IX. Conclusions 

We have presented Privelet, a data publishing technique 
that utilizes wavelet transforms to ensure e-differential pri- 
vacy. Compared to the existing solutions, Privelet provides 
significantly improved theoretical guarantees on the accuracy 
of range-count queries. Our experimental evaluation demon- 
strates the effectiveness and efficiency of Privelet. 

For future work, we plan to extend Privelet for the case 
where the distribution of range-count queries is known in 
advance. Furthermore, currently Privelet only provides bounds 
on the noise variance in the query results; we want to in- 
vestigate what guarantees Privelet may offer for other utility 



metrics, such as the expected relative error of the query 
answers. 
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Appendix 

A. Proof of Lemma [7] 

Let T\ and T2 be any two tables that differ in only one 
tuple, Mi and M2 be the frequency matrices of T\ and T2, 
respectively. Let T3 = T\C\T2, and M3 be the frequency matrix 
of T3. Observe that M\ and M3 differ in only one entry, and 
the entry's value in M\ differs from its value in M3 by one. 
Since F has a generalized sensitivity p with respect to W, 

E (w) • i/( M i) - /( M 3)i) < p ■ iimi - m 3 ||i - p. 



Similarly, we have 

E ( W (/) • l/( M 2) - /(M 3 )|) < p ■ \\M 2 - M 3 1 



p. 



feF 



Let /j (i G [1, |F|]) be the i-th query in F, and X4 be an 
arbitrary real number. We have 

Pr{Q{T 2 ) = (x!,x 2 , ■ ■ -,x\f\)} 
Pr{Q{T 2 ) = (xi,x 2 , . . -,x\ F \)} 



hSl ' exp ( - W(/0 • \ Xi - / 4 (M 2 )|/a)) 



• exp ( - W(/0 • |ari - MMOI/A)) 

< n£l exp • l/^MO - /,(M 2 )|/A) 

< n£l exp • l/^Mi) - / l (M 3 )|/A 

+ W(/i)-|/i(Afa)-/i(M 3 )|/A) 

< e 2p/A ; 

which completes the proof. 

B. Proof of Lemma \3\ 

Let R be the decomposition tree of M*. Recall that each 
entry v in M* can be expressed as a weighted sum (see 
Equation [3]) of the base coefficient Co E G and the ancestors 
of v in R. In particular, the base coefficient has a weight 1 in 
the sum. On the other hand, an ancestor c of v has a weight 
1 (—1) in the sum, if v is in the left (right) subtree of c. 
Therefore, for any one-dimensional range-count query with a 
predicate "Ai G S{\ its answer on M* can be formulated as 
a weighted sum y of the wavelet coefficients as follows: 



V = \Si\ ■ c 



E 

cec\{c„} 



c-(a(c)-(3(c)) 



where a(c) (/3(c)) denotes the number of leaves in the left 
(right) subtree of c that are contained in Si. 

For any coefficient c, if none of the leaves under c is 
contained in Si, we have a(c) = (3(c) = 0. On the other 
hand, if all leaves under c are covered by Si, then a(c) = 
/3(c) = 2 l ~ Level ( c \ where level(c) denotes the level of c in 
R. Therefore, a(c) — (3(c) ^ 0, if and only if the left or 
right subtree of c partially intersects S\. At any level of the 
decomposition tree R, there exist at most two such coefficients, 
since Si is an interval defined on A\. 

Let / = log 2 |M*|. Consider a coefficient c at level i 
(i £ [1,1]) of R, such that a(c) - (3(c) ^ 0. Since the left 
(right) subtree of c contains at most 2 l ~ l leaves, we have 
a(c),/3(c) G [0,2'-*]. Therefore, \a(c) - (3(c)\ < 2 l -\ Recall 
that Wiiaar(c) = 2 l ~ l+1 ; therefore, the noise in c has a 
variance at most cr 2 /4 +1 . In that case, the noise contributed 
by c to y has a variance at most 



(a(c)-/3(c)) 2 -a 2 /4 



l-i+l 



< 



(2 i -' ; ) 2 - ( 7 2 /4'- 4+1 



On the other hand, the noise in the base coefficient cq has a 
variance at most (a/\M*\) 2 . Therefore, the noise contributed 
by Co to y has a variance at most |5i| 2 • (cr/|Af*|) 2 , which is 
no more than a 2 . 

In summary, the variance of noise in y is at most 

o- 2 +2 • / • cr 2 /4 = (2 + log 2 |M*|)/2-cr 2 , 

which completes the proof. 

C. Proof of Lemma \5\ 

We will prove the lemma in two steps: The first step 
analyzes the variance of noise in each coefficient in C*; the 
second step shows that the result of any range-count query 
can be expressed as a weighted sum of the coefficients in C*, 
such that the variance of the sum is less than a 2 . 

Let C be the set of nominal wavelet coefficients of the input 
frequency matrix M. Let G be any sibling group in C, and 
G (G*) be the corresponding group in C' (C*). By the way 
each coefficient in C is computed, J2 g ec9 = Let gi, g[, 
and g* denote the i-th (i G [1, \G\)) coefficient in G, G' , and 
G*, respectively. Let rji — g[ — gt be the noise in g[. We have 



9i 



fJi 
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1 \G\ 1 \G\ 

|G| ' E^ = 9i + Vi - j£| • ^2(9i + m) 



1 1 IGI 

1 -\G\ ) ' ni ~VT\^ 



(8) 



Recall that WNom(gi) = 1/(2 - 2/|G|). Hence, ^ has a 
variance at most 4(1 — 1/|G|) 2 • tr 2 . By Equation^ the noise 
in gl has a variance at most 



((1 - 1/|G|) 2 + (1/|G|) 2 • (|G| - 1)) • 4(1 - 1/|G|) 2 • a 2 



4(l-l/|G|) J -a 2 



(9) 



Similarly, it can be proved that for each non-base coeffi- 
cient c* in G*, the variance of the noise in c* is at most 
4 • (1 — 1/ f) 3 ■ a 2 , where / is the fanout of c*'s parent in the 
decomposition tree R. On the other hand, the base coefficient 
Cq in G* has a noise variance at most a 2 , since it is identical 
to the base coefficient in C' . 

Now consider any range-count query q, such that the 
predicate in q corresponds to a certain node /V in the hierarchy 
H associated with M*. Let h be the height of H. Given M*, 
we can answer q by summing up the set S of entries that are 
in the subtree of /V in H. If N is a leaf in H, then S should 
contain only the entry v* G M* that corresponds to N. By 
Equation [5] 



-h-i 




(10) 



= a 2 /4. 



where c* G G* is the ancestor of v* at the (i+l)-th level of the 
decomposition tree, and /j is the fanout of c*. As discussed 
above, Cg has a noise variance at most a 2 , while c* (i G 
[1, h — 1]) has a noise variance at most 4 • (1 — l//;_i) 3 • a 2 . 



By Equation \TU\ and the fact that f i > 1, it can verified that 
the variances of the noise in v* is less than 4a 2 . 

On the other hand, if N is a level-(/i — 1) node in H, then 
S should contain all entries in M* that are children of N in 
H. Observe that each of these entries has a distinct parent in 
the decomposition tree R, but they have the same ancestors at 
levels 1 to h — 2 of R. Let c* be the ancestor of these entries 
at level i + 1, and fi be the fanout of c*. Let X be the set of 
wavelet coefficients in R that are the parents of the entries in 
S. Then, X should be a sibling group in C* , and \X\ = fh-2- 
In addition, we have J2 c *ex c * ~ 0, as ensured by the mean 
substraction procedure. In that case, by Equation |5j the sum 
of the entries in S equals 





Taking into account the noise variance in each c* and the 
fact that fj > 1, we can show that the variance of noise in 

J2 v *es v * ' s a ^ so l ess t ' lan 4< 7 ' ! - 

In general, we can prove by induction that, when N is a 

level k (k G [1, h — 2]) node in H, the answer for q equals 



(11) 



where c£ _ 1 is a wavelet coefficient at level k of the decom- 
position tree R, c* (i G [0, k — 2]) is the ancestor of cj_ 1 at 
level i + and /$ is the fanout of c*. Based on Equation Q~T] 
it can be shown that the noise variance in the answer for q is 
less than 4er 2 , which completes the proof. 

D. Proof of Theorem \2\ 

Let M' be any matrix that can be obtained by changing a 
certain entry v in the input matrix M, Let S = \\M — M'\\i, 
and Ci (Cj') be the step-i matrix in the HN wavelet transform 
on M (M'), For any j G [l,m], let and C-(j") denote 

the j-th entry in Ci and C[, respectively. By the definition 
of generalized sensitivity, Theorem [2] holds if and only if the 
following inequality is valid: 

m d 

(wHN{C d (j)) ■ \C d (j) - C d (j)\) < S ■ Y[V(Ai). (12) 

j=l t=l 

To establish Equation[l2] it suffices to prove that the following 
inequality holds for any k G [l,d\. 

rn k 

E (^H N (C k (j)) ■ \C k (j) C' k (j)\) < S -HviAi). (13) 

j=l i=l 

Our proof for Equation [T3l is based on an induction on k. For 
the base case when k = 1, Equation [T3l directly follows from 
Lemmas [2] and |4] Assume that Equation Qj] holds for some 



k = I G [1, d — 1]. We will show that the case when fc = Z + 1 
also holds. 

Consider that we transform Ci into C ; ', by replacing the en- 
tries in Cf with the entries in C[ one by one. The replacement 
of each entry in C\ would affect some coefficients in Ci+%. 
Let C["\ denote the modified version of C;+i after the first 
a entries in Ci are replaced. By Lemmas [2] and |4] and by the 
way we assign weights to the coefficients in C;+i, 



, W HN {C l+l (j)) 



c\l\(j)-c\l-V{j) 

<P(A i+1 )-|Q(a)-C;(a)|. 



This leads to 



P(A l+1 ) ■ E (w^(C(a)) • |C,(a) - 

a=l j=l 
m 

> E (w ffJV (a+i(i)) • - c l+1 (j)\ ) (H) 

3=1 

By Equation [14] and the induction hypothesis, Equation [13] 
holds for k = I + 1, which completes the proof. 

£. Proof of Theorem [5] 

Our proof of Theorem [3] utilizes the following proposition. 

Proposition 1: Let M, M' , and M" be three matrices that 
have the same set of dimensions. Let M' d , and M'J be HN 
wavelet coefficient matrix of M, M', and M", respectively. 
If M + M' = M", then M d + M' d = M'J. 

Proof: Observe that both the Haar and nominal wavelet 
transforms are linear transformations, since each wavelet co- 
efficient they produce is a linear combination of the entries in 
the input matrix. Consequently, the HN wavelet transform, as 
a composition of the Haar and nominal wavelet transforms, 
is also a linear transformation. Therefore, M + M' = M" 
implies M d + M' d = M d \ ■ 

In the following, we will prove Theorem [3] by an induction 
on d. For the base case when d = 1, the theorem follows 
directly from Lemmas [3] and [5] Assume that theorem also 
holds for some d = k > 1. We will prove that the case for 
d = k + 1 also holds. 

Let CjJ be the step- A; matrix reconstructed from C^ +1 
by applying inverse wavelet transform along the (k + l)-th 
dimension of C£ +1 . We divide into |^4fe+i| sub-matrices, 
such that each matrix C k [a] contains all entries in whose 
last coordinate equals a G Ak+i- Observe that each C^[a] can 
be regarded as a fc-dimensional HN wavelet coefficient matrix, 
and can be used to reconstruct a noisy frequency matrix M* [a] 
with k dimension A\, , . . ,A]~, 

Consider any range-count query q on M* with a predicate 
"Ai G Si" on Ai (i G [1, k + 1]). Let us define a query q' on 
each M*[a], such that q' has a predicate "Ai G Si" for any 



i e [1, k]. Let q(M*) be the result of q on M*, and g'(M*[a]) 
be the result of q' on M*[a]. Let M' = J2 a &s k+1 M *i a ]- lt 
can be verified that 

g(M*)= ^ g'(M*[a]) = g'(M'). (15) 

Therefore, the theorem can be proved by showing that the 
noise in q'(M') has a variance at most a 1 ■ Y[i—i Ti(Ai). For 
this purpose, we will first analyze the the noise contained in 
the HN wavelet coefficient matrix C' k of M'. 

By Proposition Q] and the definition of M', we have C' k — 
SaeSfe+i ^feH- Let d be an arbitrary coefficient in M' k with 
a coordinate X{ on the i-th dimension (i 6 [l,k]). Let V k * 
(V k+1 ) be a vector that contains all coefficients in C k (C k+1 ) 
whose coordinates on the first k are identical to those of c'. 
Then, c' can be regarded as the result of a range-count query 
on V k * as follows: 

SELECT COUNT(*) FROM V k * 
WHERE A k+1 E Sfc+1 

Observe that V k * can be reconstructed from V k+1 by applying 
inverse wavelet transform. Since each coefficient c* £ C k+1 
has a noise variance (<t/Whn(c*)) , by Lemmas [3] and El the 
result of any range-count query on V k * should have a noise 
variance at most 



H{A 



k+lj 



Whn(c*), 

where Wk+i is the weight function associated with the one- 
dimensional wavelet transform used to convert Ck into Ck+i- 
It can be verified that 

W k+1 {c*)/W H N{c) = 1/Whn{c). 

Therefore, the noise in c' has a variance at most TL(Ak+i) ■ 
(<j/W HN (c>)) 2 . 

In summary, if we divide the coefficients in C k+l into a set 
U of vectors along the (fc + l)-th dimension, then each coeffi- 
cient c! G M k can be expressed as a weighted sum of the coef- 
ficients in a distinct vector in U. In addition, the weighted sum 
has a noise with a variance at most TL{Ak+i) • (c/Whn(^)) • 
Furthermore, the noise in different coefficients in M' k are 
independent, because (i) the noise in different vectors in U are 
independent, and (ii) no two coefficients in ML correspond to 
the same vector in U. Therefore, by the induction hypothesis, 
for any range-count query on the fc-dimensional frequency 
matrix M' reconstructed from M' k , the query results has a 
noise with a variance no more than a 2 -5Zi=i This, by 

Equation [15] shows that the theorem also holds for d = k + 1, 
which completes the proof. 



